APPLICATION FOR 
UNITED STATES PATENT 



In the name of 



Ezra J. Rapoport 



For 

SPEECH COMPRESSION USING PRINCIPAL 
COMPONENT ANALYSIS 



Fish & Richardson P.C. 
225 Franklin Street 
Boston, MA 02110-2804 
Tel: (617) 542-5070 
Fax: (617) 542-8906 

ATTORNEY DOCKET: DATE OF DEPOSIT: July 21, 2003 



14501-002001 



EXPRESS MAIL NO. : EV 331002101 US 



Docket No.: 14501-002001 



SPEECH COMPRESSION USING PRINCIPAL COMPONENT ANALYSIS 

PRIORITY TO OTHER APPLICATIONS 

This application claims priority from and incorporates 
herein U.S. Provisional Application No. 60/428,551, filed 
November 21, 2002, and titled "Speech Compression Using 
Principal Component Analysis." 

BACKGROUND 

This invention relates to speech compression. 

In a typical communications system, a message is sent 
from a transmitter to a receiver over a channel. The rate at 
which the information is received by the receiver is limited 
by the bandwidth of the channel and the amount of information 
sent. One way to improve communications is to widen the 
bandwidth. However, in most situations, the bandwidth is fixed 
due to the infrastructure of wires, fiber optics, etc. 

Another way to improve the rate of information received 
is to compress the information. The ultimate goal of 
compression is to store data more efficiently by reducing the 
bandwidth required to transmit a given amount of information. 
Compression is also highly valuable for practical reasons, 
such as reducing costs associated with computer memory and 
other storage methods. 

SUMMARY 

Quasi-periodic waveforms can be found in many areas of 
the natural sciences. Quasi-periodic waveforms are observed in 
data ranging from heartbeats to population statistics, and 
from nerve impulses to weather patterns. The "patterns" in 
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the data are relatively easy to recognize. For example, 
nearly everyone recognizes the signature waveform of a series 
of heartbeats. However, programming computers to recognize 
these quasi-periodic patterns is difficult because the data 
5 are not patterns in the strictest sense because each quasi- 
periodic data pattern recurs in a slightly different form with 
each iteration. The slight pattern variation from one period 
to the next is characteristic of "imperf ect" natural systems. 
It is, for example, what makes human speech sound distinctly 

10 human. The inability of computers to efficiently recognize 

quasi-periodicity is a significant impediment to the analysis 
and storage of data from natural systems. Many standard 
methods require such data to be stored verbatim, which 
requires large amounts of storage space. Consequently, 

15 compression of quasi-periodic data has long been an evasive 
goal of scientists from diverse fields. 

In one aspect, the invention is a method for compressing 
data. The method includes parsing an input waveform into 
pitch segments; determining principal components of at least 

20 one pitch segment; and sending a subset of the determined 

principal components during an initial transmission period. 
The method also includes sending coefficients of the input 
waveform for each pitch segment during a period subsequent to 
the initial transmission period. 

25 In another aspect the invention is a method of receiving 

an input waveform. The method includes receiving a subset of 
determined principal components of at least one pitch segment 
during an initial transmission period and receiving 
coefficients of the input waveform for each pitch segment 

30 during a period subsequent to the initial transmission period. 
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In another aspect, the invention is an apparatus that 
includes a memory that stores executable instructions for 
compressing speech data. The apparatus also includes a 
processor that executes the instructions to parse an input 
5 waveform into pitch segments; to determine principal 

components of at least one pitch segment; and to send a subset 
of the determined principal components during an initial 
transmission period. The processor also executes instructions 
to send coefficients of the input waveform for each pitch 
10 segment during a period subsequent to the initial transmission 
period. 

In another aspect, the invention is an apparatus that 
includes a memory that stores executable instructions for 
receiving an input waveform. The apparatus also includes a 

15 processor that executes the instructions to receive a subset 
of determined principal components of at least one pitch 
segment during an initial transmission period; and to receive 
coefficients of the input waveform for each pitch segment 
during a period subsequent to the initial transmission period. 

20 In still another aspect, the invention is an article that 

includes a machine-readable medium that stores executable 
instructions for compressing speech data. The instructions 
cause a machine to parse an input waveform into pitch 
segments; to determine principal components of at least one 

25 pitch segment; and to send a subset of the determined 

principal components during an initial transmission period. 
The instructions also cause a machine to send coefficients of 
the input waveform for each pitch segment during a period 
subsequent to the initial transmission period. 

30 In another aspect, the invention is an article that 

includes a machine-readable medium that stores executable 
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instructions for receiving an input waveform. The 
instructions cause a machine to receive a subset of determined 
principal components of at least one pitch segment during an 
initial transmission period and to receive coefficients of the 
5 input waveform for each pitch segment during a period 
subsequent to the initial transmission period. 

One or more of the aspects may have one or more of the 
following advantages. The invention achieves compression 
rates that surpass the highest standards currently available. 
10 These increases in compression translate into savings of 

processing time and data storage. The method is also suitable 
for real-time applications such as telecommunications. For 
example, using only 3 kbps, the method allows for twenty 
conversations over a single phone line. 

15 

DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a telecommunications system. 
FIG. 2 is a flowchart of a process to compress speech. 
FIG. 3 is a flowchart of a process to determine a pitch 
20 period. 

FIG. 4 is an input waveform showing the relationship 
between vector length, buffer length and pitch periods. 

FIG. 5 is an amplitude versus time plot of a sampled 
waveform of a pitch period. 
25 FIGS. 6A-6C are plots representing a relationship between 

data and principal components. 

FIG. 7 is a flowchart of a process to determine principal 
components and coefficients. 

FIG. 8 is a plot of an eigenspectrum for a phoneme. 
30 FIG. 9 is a flowchart of a process to reconstruct 

waveforms . 
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FIGS. 10A-10C are plots of principal components. 

FIGS. 11A-11F are plots of reconstructed waveforms versus 
actual waveforms. 

FIG. 12 is a plot of distances of pitch periods from 
5 their centroid. 

FIGS. 13A-13D are graphs of the coefficients for the 
first four principal components of a waveform. 

FIGS. 14A-14B are plots of the same phoneme spoken in 
different surrounding environments . 
10 FIG. 15 is a flowchart of a process using principal 

component analysis (PCA) in speech recognition. 

FIG. 16 is a flowchart of a process using PCA in speech 
synthesis . 

FIG. 17 is a block diagram of a computer system on which 
15 the process of FIG. 2 may be implemented. 

DESCRIPTION 

Referring to FIG. 1, a telecommunications system 5 
includes a transmitter 10 that sends signals over a medium 11 

20 (e.g., network, atmosphere) to a receiver 40. Transmitter 10 
includes a microphone 12 for receiving an input signal, e.g., 
waveform A, a pitch track analyzer 14, a switch 16, a 
principal component analysis (PCA) generator 18 and a spacing 
coefficient generator 20. Principal component analysis (PCA) 

25 is a linear algebraic transform. PCA is used to determine the 
most efficient orthogonal basis for a given set of data. When 
determining the most efficient axes, or principal components 
of a set of data using PCA, a strength (i.e., an importance 
value called herein as a coefficient) is assigned to each 

30 principal component of the data set. 
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The pitch track analyzer 12 determines the pitch periods 
of the input waveform. A signal switch 16 routes the signal 
to the PCA generator 18 during an initial calibration period. 
PCA generator 18 calculates the principal components for the 
5 initial pitch period received. PCA Generator 18 sends the 
first 6 principal components for transmission. After the 
initial transmission period, switch 16 routes the input signal 
to coefficient generator 18, which generates coefficients for 
each subsequent pitch period. Instead of sending the 

10 principal components, only the coefficients are sent, thus 
reducing the number of bits being transmitted. Switch 16 
includes a mechanism that determines if the coefficients being 
used are valid. Coefficients deviating from the original 
coefficients by more than a predetermined value are rejected 

15 and new principal components are determined and hence new 
coefficients . 

Receiver 40 includes a storage device 42 for storing the 
principal components received from transmitter 10, a 
multiplier 46, an adder 48 and a transducer 50. Each set of 

20 principal components stored in storage 42 is coupled to a 

corresponding set of coefficients received from transmitter 
10. Each coupled product is summed by pitch period to 
generate an approximation of the waveform A. The result is 
sent to transducer 50. 

25 Referring to FIG. 2, as will be described below, 

telecommunications system 5 uses a process 60 to implement 
speech compression. Process 60 determines (62) the pitch 
period of the input waveform using a pitch tracking process 62 
(FIG. 3). Process 60 generates (64) PCA components and PCA 

30 coefficients using a principal components process 64 (FIG. 7) . 
Process 60 reconstructs (66) the input waveform received from 
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the PCA components and coefficients. Details of a waveform 
reconstruction process 66 will be described in FIG. 9. 

A. PITCH TRACKING 

5 Process 60 is one example of an implementation to use 

principal component analysis (PCA) to determine trends in the 
slight changes that modify a waveform across its pitch periods 
including quasi-periodic waveforms like speech signals. In 
order to analyze the changes that occur from one pitch period 

10 to the next, a waveform is divided into its pitch periods 
using pitch tracking process 62. 

Referring to FIGS. 3 and 4, pitch tracking process 62 
receives (68) an input waveform 75 to determine the pitch 
periods. Even though the waveforms of human speech are quasi- 

15 periodic, human speech still has a pattern that repeats for 
the duration of the input waveform 75. However, each 
iteration of the pattern, or "pitch period" (e.g., PPi) varies 
slightly from its adjacent pitch periods, e.g., PP 0 and PP 2 . 
Thus, the waveforms of the pitch periods are similar, but not 

20 identical, thus making the time duration for each pitch period 
unique . 

Since the pitch periods in a waveform vary in time 
duration, the number of sampling points in each pitch period 
generally differs and thus the number of dimensions required 

25 for each vectorized pitch period also differs. To adjust for 
this inconsistency, pitch tracking process 62 designates (70) 
a standard vector (time) length, V L . After pitch tracking 
process 62 is executing, the pitch tracking process chooses 
the vector length to be the average pitch period length plus a 

30 constant, for example, 40 sampling points. This allows for an 
average buffer of 20 sampling points on either side of a 
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vector. The result is all vectors are a uniform length and 
can be considered members of the same vector space. Thus, 
vectors are returned where each vector has the same length and 
each vector includes a pitch period. 
5 Pitch tracking process 62 also designates (72) a buffer 

(time) length, B L , which serves as an offset and allows the 
vectors of those pitch periods that are shorter than the 
vector length to run over and include sampling points from the 
next pitch period. As a result, each vector returned has a 

10 buffer region of extra information at the end. This larger 
sample window allows for more accurate principal component 
calculations, but also requires a greater bandwidth for 
transmission. In the interest of maximum bandwidth reduction, 
the buffer length may be kept to between 10 and 20 sampling 

15 . points (vector elements) beyond the length of the longest 
pitch period in the waveform. 

At 8 kHz, a vector length that includes 120 sample points 
and an offset that includes 20 sampling units can provide 
optimum results. 

20 Pitch tracking process 62 relies on the knowledge of the 

prior period duration, and does not determine the duration of 
the first period in a sample directly. Therefore, pitch 
tracking process 62 determines (74) an initial period length 
value by finding a real cepstrum of the first few pitch 

25 periods of the speech signal to determine the frequency of the 
signal. A cepstrum is an anagram of the word "spectrum" and 
is a mathematical function that is the inverse Fourier 
transform of the logarithm of the power spectrum of a signal. 
The cepstrum method is a standard method for estimating the 

30 fundamental frequency (and therefore period length) of a 
signal with fluctuating pitch. 
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A pitch period can begin at any point along a waveform, 
provided it ends at a corresponding point. Pitch tracking 
process 62 considers the starting point of each pitch period 
to be the primary peak or highest peak of the pitch period. 
5 Pitch tracking process 62 determines (76) the first 

primary peak 77. Pitch tracking process 62 determines a 
single peak by taking the input waveform, sampling the input 
waveform, taking the slope between each sample point and 
taking the point sampling point closest to zero. Pitch 

10 tracking process 62 searches several peaks and takes the peak 
with the largest magnitude as the primary peak 77. Pitch 
tracking process 62 adds (78) the prior pitch period to the 
primary peak. Pitch tracking process 62 determines (80) a 
second primary peak 81 locating a maximum peak from a series 

15 of peaks 79 centered a time period, P, (equal to the prior 

pitch period, PP 0 ) from the first primary peak 77. The peak 
whose time duration from the primary peak 77 is closest to the 
time duration of the prior pitch period PP 0 is determined to 
be the ending point of that period (PPi) and the starting 

20 point of the next (PPi) . The second primary peak is 

determined by analyzing three peaks before or three peaks 
after the prior pitch period from the primary peak and 
designating the largest peak of those peaks as the second 
peak. 

25 Process 60 vectorizes (84) the pitch period. Performing 

pitch tracking process 62 recursively, pitch tracking process 
62 returns a set of vectors; each set corresponding to a 
vectorized pitch period of the waveform. A pitch period is 
vectorized by sampling the waveform over that period, and 

30 assigning the ith sample value to the ith coordinate of a 

vector in Euclidean n-dimensional space, denoted by where 
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the index i runs from 1 to n, the number of samples per 
period. Each of these vectors is considered a point in the 
space 9? n . 

FIG. 5 shows an illustrative sampled waveform of a pitch 
5 period. The pitch period includes 82 sampling points (denoted 
by the dots lying on the waveform) and thus when the pitch 
period is vectorized, the pitch period can be represented as a 
single point in an 82-dimensional space. 



10 primary peak as the first primary peak of the subsequent pitch 
period and reiterates (78) -(86). 

Thus, pitch tracking process 62 identifies the beginning 
point and ending point of each pitch period. Pitch tracking 
process 62 also accounts for the variation of time between 

15 pitch periods. This temporal variance occurs over relatively 
long periods of time and thus there are no radical changes in 
pitch period length from one pitch period to the next. This 
allows pitch tracking process 62 to operate recursively, using 
the length of the prior period as an input to determine the 

20 duration of the next. 

Pitch tracking process 62 can be stated as the following 
recursive function : 



The function f(p,p') operates on pairs of consecutive 
peaks p and p' in a waveform, recurring to its previous value 
(the duration of the previous pitch period) until it finds the 
peak whose location in the waveform corresponds best to that 



Pitch tracking process 62 designates (86) the second 




of the first peak in the waveform. This peak becomes the 
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if (y+d-1) > 0 

period (y) = wave (y+d-1); 

end 

end 

5 a(n, :) = period; % generate vector 

prior = trail-left; % of pitch period 

left = trail; 
end 

10 Of course, other code (or even hardware) may be used to 
implement pitch tracking process 62. 

B. Principal Component Analysis 

Principal component analysis is a method of calculating 

15 an orthogonal basis for a given set of data points that 
defines a space in which any variations in the data are 
completely uncorrelated. The symbol, "91 n " is defined by a set 
of n coordinate axes, each describing a dimension or a 
potential for variation in the data. Thus, n coordinates are 

20 required to describe the position of any point. Each 

coordinate is a scaling coefficient along the corresponding 
axis, indicating the amount of variation along that axis that 
the point possesses. An advantage of PCA is that a trend 
appearing to span multiple dimensions in 9? n can be decomposed 

25 into its "principal components , " i.e., the set of eigen-axes 
that most naturally describe the underlying data. By 
implementing PCA, it is possible to effectively reduce the 
number of dimensions. Thus, the total amount of information 
required to describe a data set is reduced by using a single 

30 axis to express several correlated variations. 

For example, FIG. 6A shows a graph of data points in 3- 
dimensions. The data in FIG. 6B are grouped together forming 
trends. FIG. 6B shows the principal components of the data in 
FIG. 6A. FIG. 6C shows the data redrawn in the space 



-12- 



Docket No.: 14501-002001 

determined by the orthogonal principal components. There is 
no visible trend in the data in FIG . 6C as opposed to FIGS. 6A 
and 6B. In this example, the dimensionality of the data was 
not reduced because of the low-dimensionality of the original 
5 data. For data in higher dimensions, removing the trends in 
the data reduces the data's dimensionality by a factor of 
between 20 and 30 in routine speech applications. Thus, the 
purpose of using PCA in this method of compressing speech is 
to describe the trends in the pitch-periods and to reduce the 

10 amount of data required to describe speech waveforms. 

Referring to FIG. 7, principal components process 64 
determines (92) the number of pitch periods generated from 
pitch tracking process 62. Principal components process 64 
generates (94) a correlation matrix. 

15 The actual computation of the principal components of a 

waveform is a well-defined mathematical operation, and can be 
understood as follows. Given two vectors x and y, xy r is the 
square matrix obtained by multiplying x by the transpose of y. 
Each entry [xy T ] ± f j is the product of the coordinates xx and yj. 

20 Similarly, if X and Y are matrices whose rows are the vectors 
Xi and Yjf respectively, the square matrix XY T is a sum of 
matrices of the form [xy T ]i,j: 

XY r = £ xjy'j. 

XY r can therefore be interpreted as an array of 
25 correlation values between the entries in the sets of vectors 
arranged in X and Y. So when X=Y, XX r is an "autocorrelation 
matrix, " in which each entry [XX T ] ir j gives the average 
correlation (a measure of similarity) between the vectors xi 
and Xj. The eigenvectors of this matrix therefore define a set 
30 of axes in SR n corresponding to the correlations between the 
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vectors in X. The eigen-basis is the most natural basis in 
which to represent the data, because its orthogonality implies 
that coordinates along different axes are uncorrelated, and 
therefore represent variation of different characteristics in 
5 the underlying data. 

Principal components process 64 determines (96) the 
principal components from the eigenvalue associated with each 
eigenvector. Each eigenvalue measures the relative importance 
of the different characteristics in the underlying data. 

10 Process 64 sorts (98) the eigenvectors in order of decreasing 
eigenvalue, in order to select the several most important 
eigen-axes or "principal components" of the data. 

Principal components process 64 determines (100) the 
coefficients for each pitch period. The coordinates of each 

15 pitch period in the new space are defined by the principal 

components. These coordinates correspond to a projection of 
each pitch period onto the principal components. Intuitively, 
any pitch period can be described by scaling each principal 
component axis by the corresponding coefficient for the given 

20 pitch period, followed by performing a summation of these 
scaled vectors. Mathematically, the projections of each 
vectorized pitch period onto the principal components are 
obtained by vector inner products: 

n 

x' = J] (erx)ei. 

i=i 

25 In this notation, the vectors x and x' denote a 

vectorized pitch period in its initial and PCA 
representations, respectively. The vectors ei are the ith 
principal components, and the inner product ei-x is the 
scaling factor associated with the ith principal component. 
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Therefore, if any pitch period can be described simply by 
the scaling and summing the principal components of the given 
set of pitch periods, then the principal components and the 
coordinates of each period in the new space are all that is 
5 needed to reconstruct any pitch period and thus the principal 
components and coefficients are the compressed form of the 
original speech signal. In order to reconstruct any pitch 
period of n sampling points, n principal components are 
necessary. 

10 In the present case, the principal components are the 

eigenvectors of the matrix SS T r where the ith row of the 
matrix S is the vectorized ith pitch period in a waveform. 
Usually the first 5 percent of the principal components can be 
used to reconstruct the data and provide greater than 97 

15 percent accuracy. This is a general property of quasi- 
periodic data. Thus, the present method can be used to find 
patterns that underlie quasi-periodic data, while providing a 
concise technique to represent such data. By using a single 
principal component to express correlated variations in the 

20 data, the dimensionality of the pitch periods is greatly 

reduced. Because of the patterns that underlie the quasi- 
periodicity, the number of orthogonal vectors required to 
closely approximate any waveform is much smaller than is 
apparently necessary to record the waveform verbatim. 

25 FIG. 8 shows an eigenspectrum for the principal 

components of the 'aw 1 phoneme. The eigenspectrum displays the 
relative importance of each principal component in the A aw' 
phoneme. Here only the first 15 principal components are 
displayed. The steep falloff occurs far to the left on the 

30 horizontal axis. This indicates the importance of later 

principal components is minimal. Thus, using between 5 and 10 



-15- 



Docket No.: 14501-002001 



principal components would allow reconstruction of more than 
95% of the original input signal. The optimum tradeoff 
between accuracy and number of bits transmitted typically 
requires six principal components. Thus, the eigenspectrum is 
5 a useful tool in determining how many principal components are 
required for the compression of a given phoneme (speech 
sound) . 

A representative example of program code (i.e., machine- 
executable instructions) to implement principal components 
10 process 64 is the following code using MATHLAB: 

function [v,c] = pea (periodarray, Nvect) 
% PCA principal component analysis 

% pea (periodarray) performs principal component analysis on an 
15 % array where each row is an observation (pitch-period) and 

% each column a variable. 

n = size (periodarray) ; % find # of pitch periods 

n = n(l) ; 

20 1 - size (periodarray (1, :)) ; 

v = zeros (Nvect, 1(2)); 
c = zeros (Nvect, n) ; 

e = cov (periodarray) ; % generate correlation matrix 

25 [vects, d] - eig(e); % compute principal components 

vals = diag (d) ; 

for x = 1: Nvect % order principal components 

y = 1; 

30 while vals(y) ~= max (vals); 

y = y + 1; 

end 

vals(y) = -1; 

v(x,:) = vects (:,y)'; % compute coefficients for 

35 for z = l:n % each period 

c(x,z) = dot(v(x, :), periodarray (z, :)) ; 

end 

end 

40 Of course, other code (or even hardware) may be used to 

implement principal components process 64. After using pitch 
tracking process 62 and principal components process 64, the 
input waveform is considered to be a compressed waveform where 
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the principal components and their coefficients are the 
compressed waveform. 

C. WAVEFORM RECONSTRUCTION 

5 Waveform reconstruction process 66 synthesizes the 

waveform by sequentially reconstructing each pitch period by 
scaling the principal components by their coefficients for a 
given period and summing the scaled components. As each pitch 
period is reconstructed, the pitch period is concatenated to 

10 the prior pitch period to reconstruct the waveform. To 
decrease the bit rate necessary for this compression 
technique, only a small number of principal components are 
used to compress the signal. As a result the reconstructed 
waveforms are slightly different from the originals, and so a 

15 smoothing filter can be used in the concatenation process to 
smooth over small inconsistencies. A trapezoidal smoothing 
filter known as an alpha-blending filter can be used. 

The principal components of a set of pitch periods are, 
in essence, vectors in the same dimensional space as the 

20 vectorized pitch periods. Thus, since each of the points in 
space representing a pitch period has the same number of 
coordinates as one of the axes that defines that space (the 
principal components) , each principal component itself is a 
waveform of a length of each of the pitch-period-length 

25 vectors. 

Waveform reconstruction process 66 sets (120) the buffer 
length for the smoothing filter. Waveform reconstruction 
process 66 scales (122) the principal components and sums 
(124) the principal components and uses (126) the smoothing 
30 filter to reconstruct (128) the input waveform. 
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FIGS. 10A-10C show the waveform representations of the 
first three principal components generated from a set of pitch 
periods. These vectors need only be scaled by the proper 
coefficients and summed together to reconstruct any pitch 
5 period in the waveform. 

Referring to FIGS. 11A-11F, in each of these figures, an 
additional principal component has been scaled and added to 
the prior figure to construct a closer approximation 127 to 
the actual waveform 129 so that FIG. 11A includes only one 

10 principal component, whereas FIG. 11F includes six principal 
components. Therefore, it is possible to reconstruct any 
pitch period with relatively high accuracy with a small number 
of principal components and their corresponding coefficients 
for each pitch period. The reconstructed pitch periods may 

15 differ slightly from the periods that generated them because 
not all of the principal components were used, and thus, when 
the pitch periods are concatenated, a slight discontinuity may 
occur at the point where one pitch period ends and the next 
begins. This discontinuity is eliminated using alpha-blending 

20 filter. 

A representative example of program code (i.e., machine- 
executable instructions) to implement waveform reconstruction 
process 66 is the following code using MATHLAB: 

25 function w = pes (pemtx, coeffmtx, times) 

% PCS principal component synthesis 

% pcs<pcmtx, coeffmtx, times) returns a synthesized wave (w) ) 

d = size (times) ; 
30 s = size(pcmtx); 

Nvect = s ( 1) ; 
n = d(l) ; 
v = 0; 

buffer = times (1,1); % set buffer length for 

35 c = buffer+1; % smoothing filter (alpha 

% blend) 

for x = l:n % determine length of 

v = v+ (times (x, 2) -times (x, 1) ) ; % reconstructed wave 
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end 

w = zeros (1, v+c) ; 

for x = l:n % scale and sum principal 

t = 0; % components for a single 

5 for y = l:Nvect % pitch period 

t = t + pcmtx (y, : ) *coef fmtx (y, x) ; 

end 

bcount - buffer; 
for z = 1 : (times (x, 2) ) 
10 w (c-buf fer*x) = ((w(c~ % alpha blend and build wave 

buf fer*x) ) * (bcount/buf f er) ) 
+( (t (z) )* ( (buffer-bcount) /buffer) ) ; 
c = c+1; 
if bcount>0 

15 bcount = bcount -1; 

end 

end 

end 



20 Of course , other code (or even hardware) may be used to 
implement waveform reconstruction process 66. 

The speech coding standards in digital cellular 
applications (the most bandwidth restrictive voice 
transmission protocols) range from 13 kb/s to 3.45 kb/s. That 

25 is, a speech waveform transmitted raw at 64 kb/s (8-bit 

samples at 8 kHz) can be compressed to a 3.45 kb/s signal. The 
method for compressing speech discussed here if applied to 
individual voice vowel phonemes can achieved compression to 
rates of 3.2 kb/s with highly accurate reconstruction. 

30 This speech compression technique is useful for real-time 

speech coding applications. In any real-time application, this 
technique is paired with a technique of determining phoneme 
(speech sound) changes because maximum compression is achieved 
when a set of principal components is calculated for a single 

35 phoneme . 

Any real-time speech-coding technique involves delay. The 
algorithmic delay of this technique of speech coding depends 
on the number of pitch periods used to calculate the principal 
components that will be used to code for the entire phoneme. 
40 If the principal components were calculated from all of the 
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pitch periods in a speech sample, the algorithmic delay could 
be too long to accommodate real-time communication. Thus the 
principal components for a phoneme are calculated only from 
the first few pitch periods of the sample. The pitch periods 
5 for a given phoneme are similar, so the principal components 
calculated from the first pitch periods will suffice to code 
for the next pitch periods for a short period time. However, 
if the pitch periods change, or if the phoneme being spoken 
changes, the principal components are recalculated to 

10 represent that phoneme effectively. 

One effective way of determining how well a set of 
principal components can describe a given pitch period is to 
calculate the distance of that pitch period from the centroid 
of the data that generated the pitch periods. The farther 

15 from the centroid a given pitch period is, the lesser the 
ability of a small number of principal components to 
reconstruct that pitch period accurately. 

Mathematically, the centroid of a set of vectors in 9T n as 
an unweighted average position is defined as: 



That is, where ri are the k given position vectors of the 
data, and r c is the position vector of the centroid. The n- 
dimensional distance of a point x from the centroid is 
therefore given by 



For example, FIG. 12 shows the distance of a set of pitch 
periods from their centroid. As time progresses, the ability 
of the principal components to effectively code for the pitch 



20 




2 • 
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periods decreases. Thus, at a certain threshold 130, the 
principal components are recalculated. 

The point at which the principal components should be 
recalculated is a tolerance issue. The more often the 
5 principal components are recalculated, the better the quality 
of the reproduced speech. However, frequent recalculation of 
the principal components causes a higher bit rate for 
transmission of the coded speech. Thus, the tolerance for 
noise must be balanced with the bit rate constraints placed on 

10 the coding method by the channel across which the transmission 
is to take place. 

When implemented in real-time, this coding technique will 
not produce a constant stream of data. A surge of data will be 
initially transmitted. This surge is comprised of the 

15 principal components of the upcoming phoneme. The principal 
components will be followed by a low bit rate stream of the 
coefficients for each pitch period in real-time as it is 
spoken. At the point where the principal components no longer 
suffice, a new set of principal components are calculated and 

20 transmitted, causing another surge in the bit rate of the 

transmission to be followed by a long stream of coefficients, 
and so on. The coefficients require much less bandwidth for 
transmission, and thus the data stream will be a series of 
short bit rate surges followed by long, low-bit rate data 

25 streams. 

REDUCING THE BIT RATE 

Techniques can be used to reduce the bit rate required 
for speech transmission even further with the above approach 
30 to speech compression. One technique would use a linear 

predictive-type method of reducing the bit rate required by 
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the principal component coefficients. Since the coefficients 
for given principal components follow trends over time, it may 
be possible for the receiving end of the transmission to 
predict the next values of the coefficients of the principal 
5 components and thus guess the shape of the next pitch period. 
This prediction would reduce the amount of data needed for 
transmission by requiring only an occasional corrective value 
to be transmitted if the predicted value is inaccurate, as 
opposed to transmitting every coefficient. Another technique 

10 can be used to eliminate artifacts remaining in the waveforms 
after compression. The artifact arises because the waveform of 
each pitch period contains a great deal of information about 
the acoustic settings in which the sound was spoken. If this 
information can be removed prior to coding, it will greatly 

15 reduce the bit rate of transmission. 

A. Coefficient Prediction 

Audible changes in a waveform across pitch periods occur 
slowly, over relatively long periods of time. The coefficients 

20 of the principal components for each pitch period describe the 
constantly occurring variations and indicate how much of each 
variation their respective pitch period contains. Thus, the 
coefficients for a given principal component over a series of 
pitch periods generally show very slow, definite trends. 

25 FIGS. 13A-13D show the values of the coefficients for the 

first four principal components in a set over time. The 
definite trends depicted in these four principal components 
would make prediction of the coefficient values possible. 
Being able to predict the coefficient values would 

30 greatly increase the compression ratio and could reduce the 

bit rate necessary for transmission by a factor of 10 1 or even 
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as high as 10 2 for signals with particularly predictable 
trends. This notion of a meta-trend, as distinct from the 
individual correlations that make PCA possible, is a general 
property of quasi-periodic waveforms, and is not particular to 
5 human speech. 



B. Eliminating Artifacts 

The primary purpose of speech compression is to convey 
the message contained in the speech signal while using the 

10 least amount of bandwidth. Thus, the accuracy of the phonemes 
is of greatest importance. The acoustic surroundings of the 
speaker (echo and background noise, for instance) are of much 
less importance and can even prove annoying in extreme cases. 
Referring to FIG. 14, two waveforms of the same phoneme 

15 spoken in different acoustic settings may contain different 
shapes and attributes. The different shapes of the waveforms 
indicate that the waveforms contain information describing the 
acoustic setting. A microphone in constant motion thus may 
register very different signals over time as a result of the 

20 constantly changing background despite the fact that the 

phoneme being spoken may not have changed. Thus process 60 
may be modified to recalculate principal components to adjust 
for changing acoustics. This recalculation increases the bit 
rate required for transmission. If these artifacts can be 

25 removed prior to coding, the bit rate of transmission can be 
further reduced. 



SPEECH RECOGNITION 

Referring to FIG. 15, in some embodiments, PCA can be 
30 implemented in speech recognition applications such as using a 
process 300, for example. After receiving a speech waveform 
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spoken from a speaker, process 300 isolates (302) the pitch 
periods using process 62, for example. Process 300 performs 
(306) a principal component analysis from the pitch periods to 
generate the principal components by using process 64, for 
5 example. Process 300 compares (308) the principal components 
from a library of the speaker's principal components 312, 
previously stored, with the principal components derived from 
the speech waveform. If the principal components are 
identical, process 300 generates phonemes. Process 300 
10 converts (316) the phonemes spoken to text. 

SPEECH SYNTHESIS 

Referring to FIG. 16, in other embodiments, PCA can be 
implemented in speech synthesis applications such as using a 

15 process 400, for example. Process 400 generates (404), based 
on a text input, phonemes. Process 400 sums (408) principal 
components from a library of principal components for a 
speaker and a set of coefficients from a user's speech pattern 
and combines them to form natural speech. In some 

20 embodiments, prior to combining the coefficients, process 400 
codes (416) the intonations of the speaker's speech pattern. 
For example, intonations such as a deep voice or a soft pitch 
can be reflected in the coefficients. These intonations can 
be selected by the user. 

25 FIG. 17 shows a computer 500 for speech compression using 

process 60. Computer 500 includes a processor 502, a memory 
504, a storage medium 506 (e.g., read only memory, flash 
memory, disk etc.), transmitter 10 for sending signal to a 
second computer (not shown) and receiver 40 to decompress a 

30 signal received from the second computer. In one 

implementation the computer is part of a cell phone. The 
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computer can be a general purpose or special purpose computer, 
e.g., controller, digital signal processor, etc. Storage 
medium 506 stores operating system 510, data 512 for speech 
compression, and computer instructions 514 which are executed 
5 by processor 502 out of memory 504 to perform process 60. 

Process 60 is not limited to use with the hardware and 
software of FIG. 17; it may find applicability in any 
computing or processing environment and with any type of 
machine that is capable of running a computer program. 

10 Process 60 may be implemented in hardware, software, or a 
combination of the two. For example, process 60 may be 
implemented in a circuit that includes one or a combination of 
a processor, a memory, programmable logic and logic gates. 
Process 60 may be implemented in computer programs executed on 

15 programmable computers/machines that each includes a 

processor, a storage medium or other article of manufacture 
that is readable by the processor (including volatile and non- 
volatile memory and/or storage elements), at least one input 
device, and one or more output devices. Program code may be 

20 applied to data entered using an input device to perform 
process 60 and to generate output information. 

Each such program may be implemented in a high level 
procedural or object-oriented programming language to 
communicate with a computer system. However, the programs can 

25 be implemented in assembly or machine language. The language 
may be a compiled or an interpreted language. Each computer 
program may be stored on a storage medium or device (e.g., CD- 
ROM, hard disk, or magnetic diskette) that is readable by a 
general or special purpose programmable computer for 

30 configuring and operating the computer when the storage medium 
or device is read by the computer to perform process 60. 
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Process 60 may also be implemented as a machine-readable 
storage medium, configured with a computer program, where upon 
execution, instructions in the computer program cause the 
computer to operate in accordance with process 60. 

5 The processes are not limited to the specific embodiments 

described herein. For example, the processes are not limited 
to the specific processing order of FIGS. 2, 3, 7, 9, 15 and 
16. Rather, the blocks of FIGS. 2, 3, 7, 9, 15 and 16 may be 
re-ordered, as necessary, to achieve the results set forth 

10 above. 

Other embodiments not described herein are also within 
the scope of the following claims. 
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